Data Set


Red Wine Quality

Which chemical properties influence the quality of red wines? In this project we’ll try to answer this question by exploring the red wine data set.

Univariate Plots Section

Data Overview

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Some initial observations here:

  • There are 1599 observations of 13 variables.
  • quality is an ordered, categorical, discrete variable. Most wines are rated as 6 on a 10 point scale, 75% rated as 6 or below.
  • density appears to have a small amount of variance, while it looks like there is much more variance in residual.sugar and chlorides.
  • The minimum value of citric.acid is 0.

Explore Variable Distributions

Now let’s look at the distributions of the variables.

Some observations on these:

  • The distributions of volatile.acidity, density and pH look nearly normal.
  • Other features are all seem to be distributed positively skewed.
  • Qualitatively, residual.sugar and chlorides have extreme long tail.
  • citric.acid appears to have a large number of zero values.

Residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

There is a high concentration of residual sugar value around 2.2 (the median) with some outliers along the higher ranges.

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

We see a similar distribution with chlorides. It peaks at around 0.079 (the median).

Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Number of zero-values:

## [1] 132

This is really a strange distribution. 8% (132/1599) of wines do not present citric acid at all.

Univariate Analysis


What is the structure of your dataset?

There are 1599 observations of 13 variables in red_wine data set.

What is/are the main feature(s) of interest in your dataset?

I’m most interested in the quality and how other variables affect it. The quality is scored between 0 and 10, but we only have observations with a max of 8 and min of 3. And the average quality is 5.636.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I won’t be sure until I look at correlations between variables and some bivariate plots. But volatile.acidity, citric.acid and alcohol seem to be features to do with taste of wine.

Did you create any new variables from existing variables in the dataset?

Not yet.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Some variables like residual.sugar and chlorides are distributed with a long tail. And I noticed that 8% of citric.acid values are zero.

I haven’t performed any operations yet.

Bivariate Plots Section


Correlations

Quantitatively, the following variables have relatively strong correlation with quality:

  • alcohol: 0.48
  • volatile acidity: -0.39
  • sulphates: 0.25
  • citric acid: 0.23

Strong correlations between other variables:

  • fixed acidity & pH: -0.68
  • fixed acidity & citric acid: 0.67
  • fixed acidity & density: 0.67
  • free sulfur dioxide & total sulfur dioxide: 0.67

Let’s see more details.

Relationship with Wine Quality

Alcohol

Among all features alcohol has the strongest correlation with red wine quality (0.476).

The wines rated as 3 all have alcohol values less than or equal to 11%, while roughly 75% of wines rated as 7 or 8 have alcohol values greater than 11%.

With all six quality levels, the plots start looking messy. I created a categorical variable rating, classifying the wines as low (rating 0 to 4), medium (rating 5 and 6), and high (rating 7 to 10).

##    low medium   high 
##     63   1319    217

We see that lower and medium quality wines are less common with the increase in alcohol levels. We also see that at higher alcohol levels, there are more higher quality wines.

There is a clear positive relationship between alcohol and quality. It makes sense since higher alcohol content would be related to a higher concentration of flavor. Lower concentrations of alcohol would likely have more of a “watery” mouthfeel and might not be perceived has being of a high quality.

Acidity

Volatile acidity has a negative but the second strongest correlation with wine quality (-0.391).

I added jitter and transparency to prevent overplotting. It definitely looks like there is a negative correlation between the two.

The trend is very clear, the lower the volatile acidity level the higher the wine quality. Actually it does make sense, since too high volatile acidity level can lead to an unpleasant, vinegar taste.

Now let’s look at the fixed acidity, which has a less meaningful correlation with wine quality (0.12).

As expected, the correlation is not as obvious as it between volatile acidity and quality. How about TA (total acid), the combination of fixed acidity and volatile acidity?

Well, maybe there is a trend, but still not as clear as volatile acidity. It is not a surprise, since wine on the taste is much more complex. Different types of acid will affect our feelings of it. For example, during the ageing process of Chardonnay, the malic acid will convert to lactic acid gradually, the sharp acid taste will become more smooth.

Sulphates

The third strongest correlation feature for quality is sulphates (0.25). This coefficient is not so meaningful, but let’s have a look first.

Here again I added jitter and some transparency to prevent overplotting. There does appear to be a trend toward higher sulphate levels in higher rated wines. But there also are a large number of outliers for the wines rated as 5 or 6.

There is a long tail! Maybe we should try to take a log.

It’s much better. Let’s take a look at the correlation.

##       cor 
## 0.3086419

It is higher than previous 0.25. It makes the variable more meaningful for the wine quality.

Citric Acid

Now let’s look at citric acid and quality, they have a correlation coefficient of 0.23. It’s not so ideal neither.

There is a large amount of variance in these values. But I can see a positive trend, the citric acid median values increase steadily with each successive quality rating, from 0.035 g/dm3 for wines rated as 3, up to 0.420 g/dm3 for wines rated as 8.

We see that there are a lot of wines have low citric acid concentration (also for high rating wines). This is consistent with our previous exploration, that 8% wine does not appear any citric acidity at all. As we know that in contrast to volatile acidity, citric acidity add freshness to the wine. But I think it is not a necessary feature to become quality wine.

PH

Here, we’ll take a look at ph, which has the weakest correlation with quality (0.028).

Does this mean ph level is meaningless for good wine quality?

I don’t think so. Actually, with an appropriate ph level, the wine will present a better color; the growth of bacterial will under control; and together with TA (total acid) we can initially determine the taste and style of a wine. This feature is so important that every winemaker concerns of it.

And our samples are much more normal wines than excellent or poor ones. We could see from the plot, most wines have a ph level within 3.2 to 3.4 which is already an appropriate range of ph level for red wines.

Residual Sugar

Finally, I’d like to look at quality and residual sugar plotted against each other. They have the second weakest correlation (0.031).

Wow, it has such a small amount of variance! But it does make sense. As we know, based on sweetness, wine can be categorised into several types, dry, medium, sweet and so on. Each type of wine can be good or bad. So this variable does not seem to be a feature to measure the quality of a wine.

Other Relationships

The following 4 combinations have strongest overall correlations in the data set.

  • fixed acidity & pH: -0.682
  • fixed acidity & citric acid: 0.672
  • fixed acidity & density: 0.668
  • free sulfur dioxide & total sulfur dioxide: 0.667

Some correlations are positive, some are negative. For me, these are all reasonable relationships.

Bivariate Analysis


Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

For the main feature of interest in the data set, quality has relatively strong correlations with 3 of the features: alcohol, volatile.acidity and log(sulphates).

alcohol has the strongest correlation with red wine quality (0.476). It shows a clear and positive correlation between the two in the plots. Other than a slight dip for wines rated as a 5, the median values of alcohol steadily increased with each rating.

volatile.acidity has an negative correlation with red wine quality (-0.391). The variance decreased with each increase in rating.

Like alcohol, sulphates has a positive correlation with quality (0.251). But there are also a large number of outliers for the wines rated as 5 or 6. By applying log scale, the correlation coefficient is increased to 0.309.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

fixed.acidity has relatively strong relationship with several features, like pH, citric.acid and density.

What was the strongest relationship you found?

The strongest relationship is easy to guess. pH and fixed.acidity.

Multivariate Plots Section


Relationship with Wine Quality

Alcohol + Volatile Acidity

Now let’s look at the two variables with the strongest correlations with quality plotted against each other and colored by quality.

From this plot we see that in general, wines with higher alcohol content, having a lower volatile acidity concentration produces better wines.

Volatile Acidity + Sulphates

Next, we’ll create a similar plot to examine volatile acidity and sulphates colored by quality

We see that having more sulphates on lower volatile acidity concentration tends to produce better wines. Compare with low and medium quality wines (rated as 3 to 6), this trend is not that obvious in high quality wines (rated as 7 or 8).

alcohol + sulphate

From this plot we can see that higher alcohol content combine with higher sulphates concentration tend to produce higher quality wines.

Relationships of Other Features

Let’s have a look at the combination of pH, fixed.acidity and citric.acid. They represent the top 3 strongest correlation among all features.

This is a much more typical linear relationship. The trend is so clear, the lower the ph level the higher the fixed acidity concentration, and also higher citric acid.

Multivariate Analysis


Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Most of the relationships from this part of the analysis are consistent with what is seen in the earlier sections.

Were there any interesting or surprising interactions between features?

It looks like very low sulphates concentration almost completely prevent a wine to achieve a high quality rating. But on the other hand, there do are some high rated wines with very low alcohol content, and even with a slightly high volatile acidity.

OPTIONAL: Did you create any models with your dataset?

I didn’t, because I think none of the relationship seems strong enough to creating a model.

Final Plots and Summary


Plot One: Effect of acids on wine quality

Description One

For our samples, the effect of ph and fixed acidity on wine quality was very slight. On the other hand, volatile acidity and citric acid had relatively strong correlation with wine quality. As the volatile acidity concentration increased, the wine quality tended to be lower. As the Citric Acid increased, the quality tended to be higher.

Plot Two: Effect of acohol on wine quality

Description Two

For our samples, alcohol had the strongest correlation with quality (0.476). As the alcoholic content increased, the quality of wine tended to be as well. The wines rated as 3 all had alcohol content less than or equal to 11%, while roughly 75% of the high quality wines (rated as 7 or 8) had alcohol content greater than 11%.

Plot Three: What makes good wines good?

Description Three

With medium quality wines removed from the data, we see a clearer pattern that high rating wines distributed on higher alcohol content and lower volatile acidity area. In another word, the combination of high alcohol content and low volatile acidity tended to produce better wines.

Reflection


The red wine data set contains 1,599 observations with 11 variables on the chemical properties, and it was provided in a clean format, without any missing data. My goal was to find out which chemical properties influence the quality of red wines.

I started by examining each of the feature to get a feel for the data set and ranges of values. As a result, I found out that most features were skewed distributed with long tail. I also noticed the high concentration of wines in the middle ranges of the ranking, which means our samples are much more normal wines than excellent or poor ones. It troubled me, since I could not figure out whether those long tails in the distribution were outliers or just the a result of uneven samples. Based on this, I did some research and start to realize that in the real world there are much more normal wines and it supposed to be like this. We should regard these long tails as outliers, because they won’t help in reasoning the pattern.

I decided to explore the relationship between features. With no surprise, there was not a single strong correlation between quality and other features, but some of them did seem to be more influential than others. It makes sense, since the wine quality is much more complex than diamond price which is dominated by their size or carat.

Most of my visualization in this project was done on the 4 features that have the highest correlation coefficient with quality: alcohol (0.476), volatile.acidity (-0.391), sulphates (0.251) and citric.acid (0.226). I also explored on the weakest correlation with quality: pH (-0.058) and residual.sugar (0.014), tried to understand the reason.

During the exploration, plots started looking messy with so many quality scores. So I created a categorical variable rating, classifying the wines as low (rating 0 to 4), medium (rating 5 and 6), and high (rating 7 to 10).

In the end, with medium quality wines removed from the final visualization, I can see a clearer pattern that high rating wines distributed on higher alcohol content and lower volatile acidity area. In another word, the combination of high alcohol content and low volatile acidity tended to produce better wines.

For improvement, I think the data set is pretty limited with 12 chemical properties, it will be great if other variables such as grape type and wine age can be included for further investigation.

Reference